Neuroimaging feature extraction using a neural network classifier for imaging genetics

Background Dealing with the high dimension of both neuroimaging data and genetic data is a difficult problem in the association of genetic data to neuroimaging. In this article, we tackle the latter problem with an eye toward developing solutions that are relevant for disease prediction. Supported by a vast literature on the predictive power of neural networks, our proposed solution uses neural networks to extract from neuroimaging data features that are relevant for predicting Alzheimer’s Disease (AD) for subsequent relation to genetics. The neuroimaging-genetic pipeline we propose is comprised of image processing, neuroimaging feature extraction and genetic association steps. We present a neural network classifier for extracting neuroimaging features that are related with the disease. The proposed method is data-driven and requires no expert advice or a priori selection of regions of interest. We further propose a multivariate regression with priors specified in the Bayesian framework that allows for group sparsity at multiple levels including SNPs and genes. Results We find the features extracted with our proposed method are better predictors of AD than features used previously in the literature suggesting that single nucleotide polymorphisms (SNPs) related to the features extracted by our proposed method are also more relevant for AD. Our neuroimaging-genetic pipeline lead to the identification of some overlapping and more importantly some different SNPs when compared to those identified with previously used features. Conclusions The pipeline we propose combines machine learning and statistical methods to benefit from the strong predictive performance of blackbox models to extract relevant features while preserving the interpretation provided by Bayesian models for genetic association. Finally, we argue in favour of using automatic feature extraction, such as the method we propose, in addition to ROI or voxelwise analysis to find potentially novel disease-relevant SNPs that may not be detected when using ROIs or voxels alone.


Background
Brain imaging genomic studies have great potential for better understanding psychopathology and neurodegenerative disorders. While high-throughput genotyping technology can determine high-density genetic markers single nucleotide polymorphisms (SNPs), neuroimaging technology provides a great level of detail of brain structure and function [1]. Various modalities of brain imaging can be used to generate meaningful biological information that can in turn be used to evaluate how genetic variation influences disease and cognition. In Alzheimer's disease (AD), structural modalities such as magnetic resonance imaging (MRI) can detect the presence of neuronal cell loss and gray matter atrophy, both indicators of neurodegeneration. Such neuroimaging phenotypes are attractive because they are closer to the biology of genetic function than clinical diagnosis [2].
Imaging genetic data analysis is a statistically challenging task due to the high dimension of both the neuroimages and genetic data. Further increasing the challenge is the fact that the data can be of multiple forms; neuroimages can be collected in multiple formats, e.g. MRI, Computerised Tomography (CT), Positron Emission Tomography (PET) using different machines and in different institutions. Consequently, it is important to find a general solution to the dimension problem that is applicable on a wide range of data structure, which is what we propose in this manuscript.
We consider studies having an emphasis on exploring the relation between genetic variation and brain imaging from structural modalities such as MRI and consider associated statistical methodology for dimension reduction and genetic variable selection. We focus our effort on the identification of SNPs that are potentially related to disease, for example, AD, with brain imaging endophenotypes which have the potential to provide additional structure related to the underlying etiology of the disease. Existing approaches for such analysis are based on considering the imaging data through a specific set of regions of interest (ROIs) (see, e.g., [3][4][5][6][7]) or they are based on a full voxelwise analysis with statistical models fit at each voxel (see, e.g., [8][9][10][11][12]).
The first approach for statistical analysis in studies of imaging genetics developed brain-wide and genome-wide mass univariate analyses [9]. A drawback of this framework is that it ignores linkage disequilibrium and the associated multicollinearity between genetic markers as well as dependence between the components of the imaging phenotype. Hibar et al. [8] employed gene-based dimensionality reduction to avoid collinearity of SNP vectors. Vounou et al. [6] employed sparse procedures based on reduced-rank regression while Ge et al. [11] considered multi-locus interactions and developed kernel machine approaches. A review of methods is provided by Nathoo et al. [13].
Bayesian joint modelling combining imaging, genetic and disease data has been considered in [14] and [15]. The proposed joint models use logistic regression to relate disease endpoints to imaging-based features and a second regression relates imaging to genetic markers. Spike-and-slab selection is employed in both regression components of the joint model. Hierarchical models accounting for spatial dependence in the imaging phenotype using Markov random fields have been developed in [16] and [17]. Zhu et al. [7] developed a Bayesian reduced rank regression reducing the dimension of the regression coefficient matrix and incorporating a sparse latent factor representation for the covariance matrix of the imaging data based on a gamma process prior. Kundu et al. [18] proposed a semiparametric conditional graphical model for imaging genetics within the context of functional brain connectivity where a Dirichlet process mixture is used for clustering regression coefficients into a modular structure. Azadeh et al. [19] developed a voxelwise Bayesian approach that began by partitioning the brain into ROIs and then fitting multivariate regression models to lower-dimensional projections of the voxel-specific data within each ROI separately and in parallel across ROIs.
We investigate here a new approach for extracting imaging features in either the ROI or the voxelwise setting. Statistical learning approaches for feature construction and dimension reduction have been developed based on a number of approaches such as Gaussian Mixture Models (GMM) [20] and Principal Component Analysis (PCA) [21]. In the former, Chaddad et al. use the assignment weights of GMMs as a set of features while in the latter the low-dimension projection of PCA plays the role of extracted features. The ability of neural networks (NNs) to effectively reduce the dimension of large data has been known for some time [22]. Since then, NNs have been at the foundation of multiple feature extraction models [23][24][25] in image analysis. The autoencoder (AE) is a commonly used NN model for feature extraction [26,27]. It consists of two pieces, an encoder and a decoder. The former compresses the data, embedding it within a lower-dimensional representation, while the latter decompresses this representation to its original dimension. Both of these components are optimized simultaneously so as to reduce the reconstruction error. The encoder and the decoder can take various forms but we will assume both are NNs.
Predicting a diagnosis successfully using NNs is also supported by a large literature [28][29][30][31][32][33][34] that has demonstrated that various modern neural network architectures, such as Convolutional Neural Networks (CNNs) [35][36][37], weighted probabilistic neural networks [38] and ensembles of deep neural networks [36,39] can achieve extremely high accuracy in the classification of MRI and PET scans. Shen et al. [40] present a thorough review of early applications of deep learning in medical imaging. Specifically within the context of imaging genetics, Ning et al. [41] were among the first to apply NN approaches. Their approach was to train a NN taking both imaging data and genetic markers as inputs to predict a binary disease response (AD diagnosis).
In the manuscript, we first present a novel three-step imaging genetic pipeline: image processing, feature extraction and finally genetic inference. This separates the pieces where we do not require strong interpretability such as image processing and feature extraction from the pieces where we do need interpretability, namely in genetic inference. Then, we argue in favour of using a prediction model for the feature extraction step. Finally, we implement a simple version of the proposed pipeline as a proof of concept and discuss our findings.
This separation is beneficial for multiple reasons. First, it allows us to utilize the increased prediction accuracy of blackbox models for feature extraction without suffering from their drawbacks such as the lack of interpretability of these models or their inability to provide us with rigorous confidence intervals or anything statistically equivalent. Additionally, it is easy to modify and improve the three pieces individually, making this pipeline applicable to a wide range of data structures. This is central to our contribution because what we propose is a general approach exemplified with a specific implementation of the approach. This way, our proposed pipeline is applicable to a wide range of imaging data and can be constructed with the latest state-of-the-art models.
Consequently, the novelty of our pipeline lies in how we utilize well-established models altogether so that the resulting SNP selection has greater meaning and relevance for disease while the imaging features are nonlinear representations that are otherwise not attainable through standard voxelwise and ROI based imaging genetic analysis. Using a classification model for feature extraction ensures that the lower-dimensional representation, the extracted features, is relevant in predicting the neurological disease of interest. A popular NN architecture for feature extraction is the AE. However, there is no way to guarantee that the lower-dimension representation is correlated with the disease of interest. Using a NNC to extract features is a way to combine the strength of AEs for producing low-dimensional representations with the high predictive accuracy of NNCs to extract features relevant to disease diagnosis. Those features are subsequently related to genetics using a Bayesian inference model accounting for grouping of regression coefficients within SNPs and within genes.
We demonstrate that it is possible to achieve higher prediction accuracy to classify disease status (AD relative to normal controls (NC)) when using NNC features compared with features used previously in the literature based on known AD ROIs. This improvement in classification accuracy could be made even larger by using more sophisticated models but this is outside of the scope of this manuscript where our focus is imaging genetics. We do not argue in favour of a specific model for image classification but rather in favour of using classification models for feature extraction. Consequently, what we propose is a general approach where the classification model can be changed depending on the task at hand.
The rest of the paper proceeds as follows. We introduce our proposed pipeline in Sect. 2. Then, in Sect. 3 we discuss our experimental testing setup and an implementation of the proposed approaches with ADNI data. Section 4 presents our findings on a case example. Finally, Sect. 5 concludes with a discussion about our experimental results, implications of the findings and possible extensions.

Concept
Based on the premise that neuroimaging data is a better representation of the phenotype of interest than clinical diagnostics, we aim at capturing genetic variations related to the disease by directly considering the brain structure. Due to the high-dimensionality of neuroimaging, we propose NNs to extract features related to disease while simultaneously reducing data's dimensionality.
We assume that the natural generation of data follows the premise that genotype is related to brain structure that in turn is related to disease as explained in [42], sequentially in that order. Our framework thus reverses this process which, while clearly an oversimplification, provides a useful mechanism for thinking about data analysis and SNP selection.
The automated disease-relevant feature extraction is based on training a classification model on the imaging data with the disease diagnostic variable as output. Without loss of generality, we propose a NN, without specifically proposing an architecture at this moment. The neurons of the second to last layer of this NN prediction function act as the features extracted by the model. Because the NN is optimized to predict disease diagnosis as accurately as possible using the image data, those neurons are in fact the variables constructed from the images that are the most appropriate to predict the disease and are consequently features relevant for SNP selection. An alternative, which we make comparisons to in our test analysis are features extracted from known disease regions using expert knowledge.

Formal definition
Let v n,m denote voxel m ∈ {1, . . . , M} for subject n ∈ {1, . . . , N } and v n denote the complete imaging data for subject n. We identify with v * n the processed image for subject n. Here, the processed images may take on different forms but v * n is some standardized image data that the prediction model f takes as input. The processing might only involve image registration in its simplest form or it might involve the extraction of volumetric and cortical thickness statistics using FreeSurfer for instance. Then, y n is the disease phenotype for subject n which can be binary or multi-class categorical. We further let g n,s denote the genetic variant s ∈ {1, . . . , S} for subject n so that g n is the genetic data for subject n.
Let h be the image processing function which takes the images v as input and outputs v * . We define as f the classification function which takes v * , the processed imaging data, as input and outputs y, the disease phenotype. We define f as a NN function composed of L layers each identified as f l : l ∈ (1, L) , f 1 being the input layer of f L the output layer. Each layer l may have a different number of neurons x, say K l . In our current parametrization, the output layer is a K L -dimensional vector, o , where o n,k =P(y n = k) , the predicted probability that subject n belongs to class k. After training the neural network f, we fit a statistical model, p, which has the genetic data g as explanatory variable and the neurons of the second to last layer of f, f L−1 , as response.
A detailed representation of the proposed pipeline is shown in Fig. 1. All components previously described are trained as follows: (i) process the raw images v , (ii) train a prediction model, f, of choice by taking the processed images v * as inputs and the diagnosis score as output and (iii) train the inference model p that predicts the features extracted from the prediction model using genetic markers as inputs. The use of a statistical model as our choice of inference model is based on the current availability of interpretable and inference-focused models in the literature.
The proposed approach can be generalized to include various prediction models such as CNNs taking images as inputs or different NN architectures with inputs being the imaging features extracted from commonly used softwares such as FreeSurfer developed by Dale, Fischl and Sereno (see [43,44]). This setup also has the flexibility to easily handle multiple brain imaging modalities which would extract features from, for example EEG, MRI and fMRI using a modular NN with different modules corresponding to different modalities. Similarly, a wide range of inference models can be used and later combined using Bayesian model averaging techniques that account for model uncertainty at the inference stage.

Methods
The aim of this section is to provide readers with a concrete implementation of the proposed pipeline to lay out a test application. It also provides results that highlight the benefits and drawbacks of the proposed approach. In the following test analysis we use disease (AD), MRI and genetic data from the ADNI1 study. FreeSurfer is used for image processing, a simple NN for the prediction model and a multivariate group-sparse Bayesian regression model for SNP selection. Figure 2 provides a visual representation of this simple implementation.
Inference model p Prediction model f Fig. 1 A conceptual representation of the proposed methodology. In this instance, the prediction model f is depicted as a fully connected NN with L layers FreeSurfer quality control and feature construction BGSMTR Fully-connected NN

Fig. 2 Experimental implementation of the proposed pipeline
We compare the prediction accuracy of the 56 volumetric and cortical thickness measurements considered in [5,45], and [46], which include locations of regions of interest such as the hippocampus, cerebellum and ventricles relevant for AD, with features automatically extracted by our proposed technique. We also compare the SNPs identified given those two sets of phenotype features. Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD).

Cohort of subjects
The cohort of subjects we use in our test application has been previously described by Mirabnahrazam et al. [34]. Briefly, the ADNI1 database has genetic information for 818 subjects. Genotyping information of the ADNI1 subjects was downloaded in PLINK [47] format from the LONI Image Data Archive (https:// ida. loni. usc. edu/). During the genotyping phase, 620,901 SNPs were obtained on the Illumina Human610-Quad Bead-Chip platform. Genomic quality control was conducted using the PLINK software and yielded 521,014 SNPs for 570 subjects. When excluding subjects that had no diagnosis label available, we ended up with 543 subjects for our analysis. The diagnosis values we consider for this experiment are NC, MCI and AD.
In summary, we have a cohort of 543 subjects with 145 NC, 256 MCI and 142 AD. We have T1-weighted baseline MRI scans for every subject as well as 521,014 SNPs.

Image preprocessing
The T1-weighted baseline MRI scans were downloaded from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (n=543). A detailed description of the MRI acquisition protocols can be found on the ADNI website (https:// adni. loni. usc. edu/ metho ds/ docum ents/ mri-proto cols). The T1-weighted images v were then segmented into gray matter (GM), white matter (WM) and cerebrospinal fluid (CSF) tissue compartments using Freesurfer (version 6.0), which is freely available for download (http:// surfer. nmr. mgh. harva rd. edu), and has been described previously [43,44,48]. A standardized quality control procedure was used to manually identify and correct any errors in the automated tissue segmentation in accordance with FreeSurfer's troubleshooting guidelines. Subsequently, cortical GM was parcellated into 68 regions using FreeSurfer's cortical Desikan-Killiany atlas [49] and 62 regions using Freesurfer's Desikan-Killiany-Tourville atlas [50]. Subcortical GM was parcellated into 45 regions using Freesurfer's "aseg" atlas and subcortical WM was parcellated into 70 regions using Freesurfer's "wmparc" atlas [51]. For the white matter parcellation (wmparc), optional Freesurfer parameters were used to ensure the entire white matter compartment was parcellated, (not just WM within a fixed default distance from GM), and any T1 hypotensities were labelled as white matter. This was done to ensure that the white matter parcellation included all white matter voxels and was not biased by individual T1 hypointensity burden. For all other parcellations, the default Freesurfer options were used. From these four parcellations, a total of 1860 features were obtained. These features included: • The volume, mean, standard deviation, min, max, and range of Freesurfer normalized T1 intensity values for the "aseg" (270 total features) and "wmparc" (420 total features) atlas parcellations. • The number of vertices, surface area, gray matter volume, thickness (mean, standard deviation), curvature (mean, Gaussian), folding index, and curvature index for the Desikan-Killiany-Tourville (558 total features) and Desikan-Killiany (612 total features) atlas parcellations.
These 1860 features form v * , the processed image.

Prediction model for feature extraction
We propose a fully connected NN as a prediction model for this simple test application. The inputs of our prediction model are the entirety of the features extracted with FreeSurfer described previously, v * . The output is AD diagnosis, which is a categorical variable for the ADNI1 data base and finally, the second to last layer of this NN are the features we are interested in. In this proposed approach, there is great flexibility to build the early stages of the NN. Specifically, we have control over the number of hidden layers and the non-linear activation function. Assuming the response is a K L -class categorical variable, the output of the NN is a K L -dimensional vector o where o n,k =P(y n = k) which represents the belief that subject n belongs to class k. The relation between the second to last layer and the output layer can be thought of as the one established between predictors and output in a multiclass logistic regression. To do so, we take K L linear combinations of the K L−1 inputs x L−1 , so that o * = Bx L−1 , where B is a K L × K L−1 matrix of coefficients. Then, as activation function, we apply, element-wise, the softmax function to make sure the values are positive and sum to one: .
The model is trained in a similar fashion to a multi-case logistic regression. We minimize the negative log likelihood loss NLLL(o, y) = N n=1 nlll n where nlll n = − K L k=1 log(o n,k )1(y n = k) . This is essentially the equivalent of maximizing the log likelihood of a multinomial distribution. Thus, one could think of the features extracted x L−1 to effectively be one logistic regression away from the disease response, however, these features are constructed from data-driven non-linear functions built from the input.
We use the Python language and the Panda package [52] to import and manipulate the data set. The feature extraction is entirely done using Python. We use the Pytorch package [53] to define and train the NNC. Our NN is a single hidden layer NN with 35 hidden nodes trained with the Adagrad [54] optimizer. Finally, in order to train the NNC to distinguish AD from NC patients and thus to extract features related with the difference between those two groups, we only keep NC and AD during the training of the NNC, thus excluding MCI patients. In other words, we train the NNC on a cohort of 287 subjects (145 NC and 142 AD).
Most of the parameters, such as the number of hidden layers (1), the optimizer (Adagrad), the learning rate (0.01), the learning decay (0) and the number of epochs (350) were selected using cross-validation with the exception of the number of neurons in the hidden layer. We have initially set the number of neurons in the second to last layer to 56 as we wanted to design our model to extract the same number of features as in previous articles [5,45], and [46]. However, reducing its number of neurons to 35 did not decrease the accuracy, so our final set of automatically-extracted features has 35 brain features.

Inference model
The SNPs dimension contrasts with its small fraction expected to be related to the imaging phenotypes. SNPs are connected to traits through various pathways and multiple SNPs on one gene often jointly carry out genetic functionalities. Therefore, it is desirable to develop a model to exploit the group structure of SNPs.
Wang et al. [4] developed Group-Sparse Multi-task Regression and Feature Selection (G-SMuRFS) to perform simultaneous estimation and SNP selection across phenotypes. Consider matrices as boldface uppercase letters and vectors as boldface lowercase letters. Given the SNP data of the ADNI participants as {g 1 , ..., g n } ⊆ R S , where n is the number of participants (sample size), S is the number of SNPs (feature dimensionality), G = [g 1 , ..., g n ] , and the imaging phenotypes as {x 1 , ..., x n } ⊆ R C , C the number of imaging phenotypes, X = [x 1 , ..., x n ] , W being a S × C matrix of regression coefficients, where the entry w ij of the weight matrix W measures the relative importance of the ith SNP in predicting the response of the j-th imaging phenotype, the matrix algebraic mathematical formulation of the regression is: where ||.|| Gr 2,1 is the group l 2,1 -norm, devised by Wang et al. [4]. We recapitulate this norm definition: consider that the SNPs, are partitioned into Q groups � = {π q } Q q=1 , such that, the i-th row of W , {w i } m q i=1 ∈ π q are genetically linked, m q being the number of SNPs in π q . Denote W = [W 1 ...W Q ] T , W q ∈ R m q ×c (1 ≤ q ≤ Q) , then the group l 2,1 -norm can be both defined as While producing sparse point estimates of regression coefficients, the G-SMuRFS lacked standard error computation. Kyung et al. [55] demonstrated that boot-strapping standard error computations preform poorly when the true value of the coefficient is zero, so an equivalent hierarchical Bayesian model was developed in [5]. The hierarchical model takes the form with the coefficients corresponding to different genes assumed conditionally independent and with the prior distribution for each W (q) having a density function that is based on a product of multivariate Laplace kernels This product Laplace density can be expressed as a Gaussian scale mixture which allows for the implementation of Bayesian inference using a standard Gibbs sampling algorithm. The algorithm is implemented in the R package bgsmtr, https:// cran.r-proje ct. org/ web/ packa ges/ bgsmtr/ bgsmtr. pdf which is available for download on the Comprehensive R Archive Network (CRAN). The selection of tuning parameters 1 , 2 in this model requires cross-validation. This model serves as our primary inference model in this test application and we refer to this model by the name of its associated package, BGSMTR.

Results
The framework we propose is designed for the identification of SNPs related to a disease of interest. In the simple implementation provided, we aim at identifying SNPs related to AD. Based on the assumption that neuroimaging features that can accurately predict disease status are more closely related to the disease, we compare the accuracy performances of logistic regression models that take NN-extracted features as inputs to the accuracy of a model that utilizes previously expert-selected features in recent imaging genetics publications such as [5,45], and [46]. For that purpose, we proceed with 50 repetitions of random sub-sampling validation: randomly dividing the data set into a training set and a test set. The training set contains 200 observations while the 73 other observations are assigned to the test set. Compared to k-fold cross-validation, random sub-sampling validation has the benefit of allowing us to fix the size of the training and testing set independently from the number of Monte Carlo samples. Table 1 shows the results. The model trained using the automatically extracted features not only has a significantly higher accuracy (p-value < 0.0001 ) but also has a smaller performance variance across the sub-samples. The better prediction performance suggests that these features are useful for subsequent genetic analysis. More sophisticated prediction models can be investigated in future studies.
To provide an additional perspective of the NN-extracted features and to visually compare them to the features selected based on standard ROIs, we compute a 2-dimensional embedding for both sets of features using a t-distributed stochastic neighbor embedding (t-SNE) as proposed in [56], a t-distributed variant of the original SNE proposed in [57]. Different from PCA that finds a linear representation capturing as much variability as possible, the SNEs proposed in [57] try to identify a low-dimensional representation to optimally preserve a neighborhood identity. A neighborhood-preserving embedding is especially interesting here as the features are extracted to carry information about the disease status of the patient. Figures 3 and 4 contain the embeddings of the training cohort containing strictly the NC and AD patients. A randomly selected neighborhood in Fig. 3 is more likely to have a high concentration of one class compared to a randomly selected neighborhood in Fig. 4.
To begin the genetic analysis, we follow the recommendations found in [17,46] and adjust for subject specific factors by fitting univariate least squares linear regression for every feature (both NN-derived and ROI-based features) onto the age, gender, education level, the APOE genotype and the total intracranial volume. The residuals from each regression are then used as the adjusted imaging response in the inference model.
We then proceed with a two-step process to reduce the number of SNPs selected. First, we reduce the large number of SNPs to a smaller subset of 485 potentially related with AD SNPs [5] based on expert advice. Second, we fit univariate models between every feature and every SNP and keep the top 100 SNPs based on the resulting p-values [58,59]. We rank the SNPs by their smallest p-value, among all models they are included in. Table 2 contains the top 20 SNPs extracted using univariate regression as explained above. Status, novel or known, is checked against two previous publications, [4] and [5]. By comparing the SNPs associated with both sets of features, we first notice the top 3 SNPs are quite similar and that overall many SNPs belong to both groups.  Additionally, we notice that the genes are also quite similar between the two sets of screened SNPs. However, we also identified multiple SNPs that were not identified using the original, expert-based, features. The possibility of identifying additional SNPs based on features that are more predictive of disease is the potential addedvalue of the proposed approach. Thus the NN derived features can be used alongside more standard ROI-based features. These novel SNPs could simply be carrying a gene specific signature but this is also a reason why we rely on a multivariate regression model to determine the final set of SNPs.
For this reason, we follow with a subsequent multivariate regression that will better allow us to distinguish between association with the features and confounding SNPs. We use the 100 screened SNPs as predictors in our inference model, the BGSMTR model described earlier. Table 3 contains the top 20 SNPs ranked by the posterior standard score: the posterior mean divided by the posterior standard deviation. In this table we see again a mix of novel and known SNPs and once again, the status, novel or known, is checked against two previous publications [4,5]. Among other, identifying the association with AD through MRI features of SNPs rs1699105, rs1699105, rs2025935 and rs12209631 to name a few is consistent with previous publications [4,5]. Since half the SNPs identified were identified in previous publications, our approach is consistent with known results and this consistency is very positive in light of the reproducibility of our data-driven approach. Our approach exhibits signs of consistency and reproducibility with past experiments.
On the flip side, if we only discover known SNPs then there is little advantage to our approach. The SNP rs6511720 is ranked very high on the list and was associated with 15 features (according to 95% credible intervals with selection as in [5]). The SNPs rs6457200, rs2243581 and rs3785817 are also ranked high and/or are related with multiple features.

Discussion
The results above provide a strong argument in favour the proposed pipeline which can be used in addition to a standard voxelwise or ROI based imaging genetics analysis. The features extracted are not only better at predicting the neurological disease of interest but more importantly, these features allowed the identification of different SNPs. For instance, we identified the SNP rs6511720, being related with 15 features, and in the meanwhile this SNP was not found to be related with expert-selected features. Therefore, our proposed method could lead to the identification of novel causal SNPs. Furthermore, the extraction process is data-driven and requires no expert advice, outside of the diagnostic. Consequently, we argue in favor of using automatic feature extraction in addition to ROI or voxelwise features to find signal potentially novel SNPs that may not be detected when using ROIs or voxels alone. Our focus here is to identify SNPs related with MRI in a manner that is predictive of disease and obtain confidence intervals and posterior distributions. Integrating machine learning approaches within imaging genetics studies is of potential use as demonstrated in our analysis.
One advantage of the procedure we propose is its flexibility: we can easily improve on each of the three pieces of the pipeline separately. However, a limitation of the study of Sect. 3 is that only a single implementation was tested on a single data set. On the flip side, it opens up possible improvements for future projects. In this first implementation of our proposed pipeline, we use the well-established FreeSurfer software to obtain volumetric and cortical thickness statistics from the MRI scans. We obtain automatically extracted features in a data-driven which have higher predictive power relevant for disease. Thus, it seems reasonable to extend that principle to image processing and also try to automatically process the images in a data-driven way. For instance, a common NNC for images is the Convolutional Neural Networks (CNNs) [35][36][37]. Using a CNN taking as input the 3-dimensional brain scan images and training this model to predict the diagnosis would be of potentially great value for further investigation. The convolutional layers replace some of the image processing steps and the lower-level layers act as the feature extractor. However, some processing, mostly registration, would still be required. As previously demonstrated [28,29], we expect the CNN to provide an even better prediction accuracy and thus features more closely related to AD. Another interesting approach to explore is to use an AE to reduce the dimension of the images in an unsupervised manner first. Different AEs can be trained for each brain region separately and it allows the number of features extracted per region to vary. This allows the collection of AEs to extract more features from regions with higher variability or from regions with more predictive power. Finally, a different perspective for future work would be to model and capture the complex interactions between the neuroimaging data and genetic data using an heterogeneous information networks. Zhao et al. [60] successfully combined different data modalities for drug-disease associations using a graph representation learning model when given a biological heterogeneous information networks.
Additionally, our work demonstrates the use of different objective functions to extract features and reduce the dimension of large observations, such as neuroimages. Instead of using unsupervised models, we are able to direct the feature extraction towards a variable of interest, in our case the disease diagnostic variable which would have otherwise not been used in the analysis relating imaging to genetics. However, with gradient-based models, such as NN, we can design many other objective functions and tailor the feature extraction process for problem-specific needs. This idea can be applied in various ways when we analyse neuroimages, and we recommend considering a large collection of objective functions that are data-driven when extracting features instead of strictly relying on expert advice.
We choose to use a NN for feature extraction, this comes with strengths and weaknesses. Because our goal is to do inference at the SNP level we agreed to lose interpretability on the neuroimage feature level, this is usually considered a weakness of blackbox models such as NNs. In counterparts, this allows us to get nonlinear features that are functions of the complete processed images and the use of classification models ensure that those features are indeed most relavent to AD. The automatic feature extraction approach provides genuine added value when used alongside studies that are conducted at either the ROI or voxelwise level. It requires no external expertise for feature selection and uses disease data that are typically available but are not typically used in such analyses. The features are built considering disease prediction through nonlinear representations of neuroimaging.
Finally, the last step of our pipeline involves an inference step using a multivariate Bayesian group sparse regression. There is scope for generalizing this step to account for model uncertainty where the Bayesian model used is included within a collection of different models (e.g., [7,[14][15][16]18]) and then Bayesian model averaging is used for inference at the SNP level while accounting for model uncertainty. This extension will be explored as part of future work.